docs: add a tutorial on how to write a new health monitor#1428
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a README link to tutorials and a new end-to-end guide for writing, validating, packaging, and registering an NVSentinel health monitor, including protocol details, an example Go implementation, remediation rules, and CI/ko integration. ChangesHealth Monitor Tutorial
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
9329434 to
5aed4a5
Compare
|
🌿 Fern Docs Preview: https://nvidia-preview-pull-request-1428.docs.buildwithfern.com/nvsentinel |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 227-249: Validate the parsed POLL_INTERVAL_SECONDS value in the
health monitor startup flow before creating the ticker in the polling loop. In
the section that currently parses pollSeconds and later calls time.NewTicker,
add a guard to reject zero or negative values and return a clear error instead
of proceeding. Keep the fix near the existing POLL_INTERVAL_SECONDS handling so
the failure is caught before the ticker is constructed.
- Around line 520-530: The setup steps in the tutorial need to be reordered
because `go get github.com/nvidia/nvsentinel/commons@v0.0.0` will fail in a
fresh checkout until the local replacement is already in place. Update the
instructions in the `writing-a-health-monitor` section so the `go.mod` replace
for `github.com/nvidia/nvsentinel/commons` is added first (or via `go mod edit
-replace`) before running `go get`, keeping the dependency fetch step after the
replace is configured.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fd89c363-9c5b-4a74-9877-39fa927e7df7
📒 Files selected for processing (2)
docs/README.mddocs/tutorials/writing-a-health-monitor.md
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
89c9f74 to
ec0590a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 15-19: The blockquotes in the tutorial markdown currently contain
blank lines that trigger markdownlint MD028, so tighten the copy in the affected
section and the repeated block around Appendix C by removing the internal blank
lines or splitting each quoted paragraph into separate blockquote blocks. Update
the relevant markdown text in the tutorial content so the quoted lines remain
readable but no longer have blank lines inside the same blockquote.
- Around line 263-265: The trigger-file check is treating all stat failures as
“healthy,” which can hide permission or I/O problems and clear an active fault;
update the fileExists helper and the demo check logic so only os.IsNotExist is
interpreted as the file being absent. Return the error from fileExists instead
of collapsing every os.Stat failure to false, and in the demo check path that
logs with slog.Info, propagate non-not-exist errors as unhealthy rather than
deriving healthy solely from fileExists. Apply the same fix to both occurrences
referenced by the demo snippets so the health monitor behavior is consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 62ef8d12-b0b4-4b72-9c3c-2e94b6c95a67
📒 Files selected for processing (2)
docs/README.mddocs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
- docs/README.md
64487fd to
02e66f5
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/tutorials/writing-a-health-monitor.md (1)
40-40: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winAlign the socket path with the mount.
The diagram says the monitor talks to
unix:///var/run/nvsentinel.sock, but the deploy step mountshostPath/var/run/nvsentinelinto/var/run. That only works if the socket actually lives under a nested directory, so readers will end up wiring the wrong path. Please make the mount and URI agree.Also applies to: 368-369
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tutorials/writing-a-health-monitor.md` at line 40, The diagram and deploy instructions use mismatched socket paths, so update the health monitor docs to make the mount target and the gRPC URI refer to the same location. Adjust the wording and diagram in the writing-a-health-monitor tutorial around the monitor-to-Platform Connector connection so the socket path shown for the monitor matches the hostPath mount used in the deployment step.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Line 40: The diagram and deploy instructions use mismatched socket paths, so
update the health monitor docs to make the mount target and the gRPC URI refer
to the same location. Adjust the wording and diagram in the
writing-a-health-monitor tutorial around the monitor-to-Platform Connector
connection so the socket path shown for the monitor matches the hostPath mount
used in the deployment step.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f8a3d558-7b55-443b-920b-297b665bae97
📒 Files selected for processing (1)
docs/tutorials/writing-a-health-monitor.md
02e66f5 to
1105a3a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/tutorials/writing-a-health-monitor.md (1)
152-173: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winClarify the
go.modscaffold wording.This says "two local replace directives", but the snippet only adds
data-modelshere;commonsis introduced later in section 8. As written, it reads like a missing step.🛠 Suggested wording
-Create `go.mod` with the two local replace directives every monitor uses (the module -path is resolved locally, not from a registry): +Create `go.mod` with the local replace directive needed for the demo module (the +module path is resolved locally, not from a registry):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tutorials/writing-a-health-monitor.md` around lines 152 - 173, The `go.mod` scaffold wording is inconsistent with the snippet because it claims “two local replace directives” even though this section only adds the `data-models` replace and introduces `commons` later in section 8. Update the tutorial text around the `go.mod` example to say this scaffold includes the local replace for `data-models` only, and mention that the `commons` replace is added later when the production publisher is introduced, so readers do not think a step is missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 626-651: The publish example currently introduces fakePC but does
not use it, and the guidance about healthpub.New is incomplete. Update the
tutorial section around fakePC, TestBuildEvent_Unhealthy, and the Publish
example so the fake client is actually used when demonstrating delivery through
healthpub, or remove the fakePC mention entirely if only buildEvent is being
tested. Make the example consistent with the non-unix target behavior of
healthpub.New so the fakePC path is exercised instead of left unused.
---
Nitpick comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 152-173: The `go.mod` scaffold wording is inconsistent with the
snippet because it claims “two local replace directives” even though this
section only adds the `data-models` replace and introduces `commons` later in
section 8. Update the tutorial text around the `go.mod` example to say this
scaffold includes the local replace for `data-models` only, and mention that the
`commons` replace is added later when the production publisher is introduced, so
readers do not think a step is missing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 15e923c0-04d0-4d5e-9098-d2553ea37bdd
📒 Files selected for processing (2)
docs/README.mddocs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
- docs/README.md
1105a3a to
c7c3147
Compare
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
c7c3147 to
fb27671
Compare
fb27671 to
294532a
Compare
ae4c4a2 to
837cc0b
Compare
be62467 to
47a1698
Compare
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
8428c6e to
b453a86
Compare
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
7691e45 to
b453a86
Compare
Summary
Adds a complete, developer-facing guide for writing NVSentinel health monitors, and its one-shot AI prompt.
Testing
Validated one-shot AI prompt end-to-end on a local KIND cluster via Tilt.
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit